perm filename MULTID[4,KMC]1 blob
sn#025674 filedate 1973-02-16 generic text, type T, neo UTF8
00100 MULTIDIMENSIONAL ANALYSIS IN EVALUATING A SIMULATION
00200 OF PARANOID THOUGHT PROCESSES
00300
00400 KENNETH MARK COLBY
00500
00600
00700
00800 Once a simulation model reaches a stage of intuitive adequacy
00900 based on face validity, the model builder then considers using more
01000 stringent evaluation procedures depending on the purposes the model
01100 is intended to serve. If the model is to serve a practical
01200 application , for example as a training device, then a rough and
01300 ready approximation may be sufficient. But when the model is being
01400 proposed as a theoretical explantion of a psychological process, more
01500 is demanded of the representation than face validity.
01600 A computer simulation model consists of a structure of
01700 hypothetical mechanisms or procedures sufficient to generate the
01800 input-output behavior under consideration. The theory embodied in the
01900 model can be made clear by statements which describe how the
02000 postulated structure reacts under various circumstances. I shall not
02100 describe a theory or model of paranoid processes here, but rather I
02200 shall concentrate on the evaluation problem which asks the
02300 disarmingly simple question `how good is the model?' While the term
02400 `good' has many senses in ordinary language, I shall take this
02500 question to mean `how close is the correspondence between the
02600 behavior of the model and the phenomenena it is intended to explain?'
02700 Turing's Test has often been suggested as an aid in answering this
02800 question for computer models but as far as I know no one has
02900 conducted a true version of this test.
03000 It is very easy to become confused about Turing's Test. In
03100 part this is due to Turing himself who introduced the now-famous
03200 imitation game in a 1950 paper entitled COMPUTING MACHINERY AND
03300 INTELLIGENCE [3 ]. A careful reading of this paper reveals there are
03400 actually two games proposed , the second of which is commonly called
03500 Turing's test.
03600 In the first imitation game two groups of judges try to
03700 determine which of two interviewees is a woman. Communication between
03800 judge and interviewee is by teletype. Each judge is initially
03900 informed that one of the interviewees is a woman and one a man who
04000 will pretend to be a woman. After the interview, the judge is asked
04100 what we shall call the woman-question i.e. which interviewee was the
04200 woman? Turing does not say what else the judge is told but one
04300 assumes the judge is NOT told that a computer is involved nor is he
04400 asked to determine which interviewee is human and which is the
04500 computer. Thus, the first group of judges would interview two
04600 interviewees: a woman, and a man pretending to be a woman.
04700 The second group of judges would be given the same initial
04800 instructions, but unbeknownst to them, the two interviewees would be
04900 a woman and a computer programmed to imitate a woman. Both groups
05000 of judges play this game until sufficient statistical data are
05100 collected to show how often the right identification is made. The
05200 crucial question then is: do the judges decide wrongly AS OFTEN when
05300 the game is played with man and woman as when it is played with a
05400 computer substituted for the man. If so, then the program is
05500 considered to have succeeded in imitating a woman as well as a man
05600 imitating a woman. For emphasis we repeat; in asking the
05700 woman-question in this game, judges are not required to identify
05800 which interviewee is human and which is machine.
05900 Later on in his paper Turing proposes a variation of the
06000 first game. In the second game one interviewee is a man and one is a
06100 computer. The judge is asked to determine which is man and which is
06200 machine, which we shall call the machine-question. It is this version
06300 of the game which is commonly thought of as Turing's test. It has
06400 often been suggested as a means of validating computer simulations of
06500 psychological processes.
06600 In the course of testing a simulation (PARRY) of paranoid
06700 linguistic behavior in a psychiatric interview, we conducted a number
06800 of Turing-like indistinguishability tests [1]. We say
06900 `Turing-like' because none of them consisted of playing the two games
07000 described above. We chose not to play these games for a number of
07100 reasons which can be summarized by saying that they do not meet
07200 modern criteria for good experimental design. In designing our tests
07300 we were primarily interested in learning more about developing the
07400 model. We did not believe the simple machine-question to be a useful
07500 one in serving the purpose of progressively increasing the
07600 credibility of the model but we investigated a variation of it to
07700 satisfy the curiosity of colleagues in artificial intelligence.
07800 In this design eight psychiatrists interviewed by teletype
07900 two patients, one being PARRY and one being an actual
08000 hospitalized paranoid patient. The interviewers were not informed
08100 that a simulation was involved nor were they asked to identify which
08200 was the machine. Their task was to conduct a diagnostic psychiatric
08300 interview and rate each response from the `patients' along a 0-9
08400 scale of paranoidness, 0 meaning zero and 9 being highest.
08500 Transcripts of these interviews, without the ratings of the
08600 interviewers, were then utilized for various experiments in which
08700 randomly selected expert judges conducted evaluations of the
08800 interview transcripts. For example, in one experiment it was found
08900 that patients and model were indistinguishable along the dimension of
09000 paranoidness.
09100 To ask the machine-question, we sent interview transcripts,
09200 one with a patient and one with PARRY, to 100 psychiatrists randomly
09300 selected from the Directory of American Specialists and the Directory
09400 of the American Psychiatric Association. Of the 41 replies 21 (51%)
09500 made the correct identification while 20 (49%) were wrong. Based on
09600 this random sample of 41 psychiatrists, the 95% confidence interval
09700 is between 35.9 and 66.5, a range which is close to chance. (Our
09800 statistical consultant was Dr. Helena C. Kraemer, Research
09900 Associate in Biostatistics, Department of Psychiatry, Stanford
10000 University.)
10100 Psychiatrists are expert judges of patient interview behavior
10200 but they are unfamiliar with computers. Hence we conducted the same
10300 test with 100 computer scientists randomly selected from the
10400 membership list of the Association for Computing Machinery, ACM. Of
10500 the 67 replies 32 (48%) were right and 35 (52%) were wrong. Based on
10600 this random sample of 67 computer scientists the 95% confidence
10700 ranges from 36 to 60, again close to a chance level.
10800 Thus the answer to this machine-question "can expert judges,
10900 psychiatrists aand computer scientists, using teletyped transcripts
11000 of psychiatric interviews, distinguish between paranoid patients and
11100 a simulation of paranoid processes? " is "No". But what do we learn
11200 from this? It is some comfort that the answer was not "yes"and the
11300 null hypothesis (no differences) failed to be rejected, especially
11400 since statistical tests are somewhat biased in favor of rejecting the
11500 null hypothesis [3]. Yet this answer does not tell us what we would
11600 most like to know, i.e. how to improve the model. Simulation models
11700 do not spring forth in a complete, perfect and final form; they must
11800 be gradually developed over time. Pehaps we might obtain a "yes"
11900 answer to the machine-question if we allowed a large number of expert
12000 judges to conduct the interviews themselves rather than studying
12100 transcripts of other interviewers. It would indicate that the
12200 model must be improved but unless we systematically investigated how
12300 the judges succeeded in making the discrimination we would not know
12400 what aspects of the model to work on. The logistics of such a design
12500 are immense and obtaining a large N of judges for sound statistical
12600 inference would require an effort disproportionate to the
12700 information-yield.
12800 A more efficient and informative way to use Turing-like tests
12900 is to ask judges to make ordinal ratings along scaled dimensions from
13000 teletyped interviews. We shall term this approach asking the
13100 dimension-question. One can then compare scaled ratings received by
13200 the patients and by the model to precisely determine where and by how
13300 much they differ. Model builders strive for a model which
13400 shows indistinguishability along some dimensions and
13500 distinguishability along others. That is, the model converges on what
13600 it is supposed to simulate and diverges from that which it is not.
13700 We mailed paired-interview transcripts to another 400
13800 randomly selected psychiatrists asking them to rate the responses of
13900 the two `patients' along certain dimensions. The judges were divided
14000 into groups, each judge being asked to rate responses of each I-O
14100 pair in the interviews along four dimensions. The total number of
14200 dimensions in this test were twelve- linguistic noncomprehension,
14300 thought disorder, organic brain syndrome, bizarreness, anger, fear,
14400 ideas of reference, delusions, mistrust, depression, suspiciousness
14500 and mania. These are dimensions which psychiatrists commonly use in
14600 evaluating patients.
14700 Table 1 shows there were significant differences, with PARRY
14800 receiving higher scores along the dimensions of linguistic
14900 noncomprehension,thought disorder, bizarreness, anger, mistrust and
15000 suspiciousness. On the dimension of delusions the patients were rated
15100 significantly higher. There were no significant differences along the
15200 dimensions of organic brain syndrome,fear, ideas of reference,
15300 depression and mania.
15400 While tests asking the machine-question indicate
15500 indistinguishability at the gross level, a study of the finer
15600 structure os the model's behavior through ratings along scaled
15700 dimensions shows statistically significant differences between
15800 patients and model. These differences are of help to the model
15900 builder in suggesting which aspects of the model must be modified and
16000 improved in order to be considered an adequate simulation of the
16100 class of paranoid patients it is intended to simulate. For example,
16200 it is clear that PARRY'S language-comprehension must be improved.
16300 Once this has been implemented, a future test will tell us whether
16400 improvement has occurred and by how much in comparison to the earlier
16500 version. Successive identification of particular areas of failure
16600 in the model permits their improvement and the development of
16700 more adequate model-versions.
16800 Further evidence that the machine-question is too coarse and
16900 insensitive a test comes from the following experiment. In this test
17000 we constructed a random version of the paranoid model which utilized
17100 PARRY'S output statements but expressed them randomly no matter what
17200 the interviewer said. Two psychiatrists conducted interviews with
17300 this model, transcripts of which were paired with patient interviews
17400 and sent to 200 randomly selected psychiatrists asking both the
17500 machine-question and the dimension-question. Of the 69 replies, 34
17600 (49%) were right and 35 (51%) wrong. Based on this random sample of
17700 69 psychiatrists, the 95% confidence interval ranges from 39 to 63,
17800 again indicating a chance level. However as shown in Table 2
17900 significant differences appear along the dimensions of linguistic
18000 noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
18100 rated higher. On these particular dimensions we can construct a
18200 continuum in which the random version represents one extreme, the
18300 actual patients another. Our (nonrandom) PARRY lies somewhere between
18400 these two extremes, indicating that it performs significantly better
18500 than the random version but still requires improvement before being
18600 indistinguishable from patients.(See Fig.1). Hence this approach
18700 provides yardsticks for measuring the adequacy of this or any other
18800 dialogue simulation model along the relevant dimensions.
18810 (Insert comparison of dimensions between PARRY and RANDOM-PARRY)
18900 We conclude that when model builders want to conduct tests
19000 which indicate in which direction progress lies and to obtain a
19100 measure of whether progress is being achieved, the way to use
19200 Turing-like tests is to ask expert judges to make ratings along
19300 multiple dimensions considered essential to the model. Useful tests
19400 do not prove a model, they probe it for its sensitivities. Simply
19500 asking the machine-question yields no information relevant to
19600 improving what the model builder knows is only a first approximation.
19700
19800
19900 REFERENCES
20000
20100 [1] Colby, K.M., Hilf,F.D., Weber, S. and Kraemer,H.C. Turing-like
20200 indistinguishability tests for the validation of a computer
20300 simulation of paranoid processes. ARTIFICIAL INTELLIGENCE,3,
20400 (1972),199-221.
20500
20600 [2] Meehl, P.E., Theory testing in psychology and physics: a
20700 methodological paradox. PHILOSOPHY OF SCIENCE,34,(1967),103-115.
20800
20900 [3] Turing,A. Computing machinery and intelligence. Reprinted in:
21000 COMPUTERS AND THOUGHT (Feigenbaum, E.A. and Feldman, J.,eds.).
21100 McGraw-Hill, New York,1963,pp. 11-35.